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More than half of Caenorhabditis elegans pre-mRNAs lose their original 5' ends in a process termed "fra/w-splicing" in which 
the RNA extending from the transcription start site (TSS) to the site of fra/75-spIicing of the primary transcript, termed the 
"outron," is replaced with a 22-nt spliced leader. This complicates the mapping of TSSs, leading to a lack of available TSS 
mapping data for these genes. We used growth at low temperature and nuclear isolation to enrich for transcripts still 
containing outrons, applying a modified SAGE capture procedure and high-throughput sequencing to characterize 5' 
termini in this transcript population. We report from this data both a landscape of 5'-end utilization for C. elegans and 
a representative collection of TSSs for 7351 fra/?$-spIiced genes. TSS distributions for individual genes were often dispersed, 
with a greater average number of TSSs for fra/w-spliced genes, suggesting that fra/w-splicing may remove selective pressure 
for a single TSS. Upstream of newly defined TSSs, we observed well-known motifs [including TATAA-box and SP1) as well 
as novel motifs. Several of these motifs showed association with tissue-specific expression and /or conservation among six 
worm species. Comparing TSS features between fra/?$-spIiced and non-fra/?$-spIiced genes, we found stronger signals among 
outron TSSs for preferentially positioning of flanking nucleosomes and for downstream Pol II enrichment. Our data 
provide an enabling resource for both experimental and theoretical analysis of gene structure and function in C elegans. 



[Supplemental material is available for this article.] 

In most organisms, the locations of transcription start sites (TSSs) 
can be determined by establishing the sequences of mRNA 5 ' ends. 
For a subset of single-cell eukaryotes and animals, however, a pro- 
cessing event known as "trans-splicing" obscures the locations of 
the TSSs (Sutton and Boothroyd 1986; Krause and Hirsh 1987; 
Lasda and Blumenthal 2011). Trans-splicing is an efficient process 
that results in removal of the 5' end of the pre-mRNA, replacing it 
with a short 5' leader that is then retained on the mature mRNA. 
The removed sequence (between the TSS and the first active 3' 
splice site in the newly transcribed mRNA precursor) is called the 
"outron" (Conrad et al. 1991). Trans-splicing is a spliceosome- 
catalyzed process that can be thought of as a surrogate use 
of a mobile 5' splice site contained on a short (usually —100 nt) 
RNA called an SL RNA. SL RNAs are present in the nucleus in the 
form of Sm protein-bound small nuclear ribonucleoprotein parti- 
cles (snRNPs), where the 5' 22 nt forms the SL exon, with the 3' 
portion serving as "intron." 

Caenorhabditis elegans has been a valuable model system for 
studying a variety of aspects of gene expression (including trans- 
splicing) (Krause and Hirsh 1987; Lasda and Blumenthal 2011). 
Approximately 70% of C. elegans genes are subject to trans-splicing 
(Allen et al. 2011; Lasda and Blumenthal 2011). C. elegans has two 
types of trans-splicing, each with a distinct splice leader (SL1 and 
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SL2) (Spieth et al. 1993). SL1 is the more common splice leader, 
with 55% of C. elegans genes spliced to SL1 (Allen et al. 2011). SL1 
trans-spliced genes are either singly transcribed genes or the first 
genes in cotranscribed operons. The SL1 trans-spliced genes are 
thus those for which the outron 5' ends correspond to TSS posi- 
tions. About 15% of all genes are trans-spliced by the spliced leader 
SL2 or one of its close relatives (Allen et al. 2011). SL2 trans-spliced 
genes are always downstream in operons, so they generally do not 
have TSSs present just 5' of the gene body. Due to the rapidity of 
trans-splicing and the overwhelming predominance of post-trans- 
splicing forms among stable cytoplasmic RNAs, the annotation of 
the worm genome has lacked a global description of TSSs, con- 
founding analysis of ds-regulatory sequences, chromatin config- 
urations around promoters, and identification of specialized fea- 
tures such as bidirectional promoters. 

Here we perform a 5 '-targeted analysis on nuclear RNA from 
C. elegans grown under conditions that slow the processing of 
outrons, enabling the mapping of a large number of TSS for both 
SL1 trans-spliced and non- trans-spliced genes and a consequent 
picture of transcriptional initiation repertoire in this system. 

Results 

Capture of 5'-end RNA reads from nuclei 

Animals grown at the low end of the normal temperature range 
were used to isolate potential pre-trans-splicing intermediates. 
The choice of lower growth temperatures to support identifica- 
tion of pre-spliced intermediates follows observations that (1) 
C. elegans strains partially defective in key trans-splicing RNP factors 
show a cold sensitivity in organismal phenotype (MacMorris et al. 
2007), and (2) splicing can be mechanistically sensitive to cold 
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temperatures as a result of numerous base-pairing interactions that 
must be remodeled during the process (Strauss and Guthrie 1991). 

To identify TSSs that are processed before trans-splicing or are 
free of trans-spliced leaders, we grew populations of animals at 
16°C, isolated nuclei from embryos and whole adult tissues, and 
extracted total nuclear RNA. To minimize contaminating ribo- 
somal RNA (rRNA) molecules, we used hybrid subtraction-based 
enrichment of RNA with the RiboMinus RNA Kit before sequenc- 
ing (Ruan et al. 2004; Chen and Duan 2011). We then used a 5'- 
SAGE method (Fig. 1A) to capture 5' ends, including DNase treat- 
ment and a phosphatase treatment preceding cap removal and 



linker ligation that strongly enriches for RNA molecules where a 
5' phosphate is only revealed following decapping (protocol de- 
scribed in Hashimoto et al. 2009). Illumina sequencing was then 
carried out, resulting in approximately 119 million and 76 million 
76-bp single-end reads collected from the nuclei of embryo and 
adult samples, respectively (see Methods). These reads were aligned 
to the C. elegans genome WS220 (Hillier et al. 2009; Harris et al. 
2010) using BWA (Li and Durbin 2009) (see Methods). From this 
analysis, approximately 55.6 million and 30.4 million reads, re- 
spectively, were anchored to unique positions (Table 1), with the 
approximately 55.6 million reads from embryo samples used as the 
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Figure 1 . Identification of TSSs from alignments of 5'-end reads. (A) Process for 5'-end library construction for Illumina GA sequencing. See details in 
Methods. (B) Frequency distribution of reads aligned to genomic locations upstream of NM_058956 (phosphatase). Three peaks were calculated as the 
best-fitting set of Gaussian distributions according to Bayesian information criteria. (C) Three classes of genes with different patterns of Gaussian TSS 
distributions in the regions of 3000 bp upstream of the translation initiation sites. (Top figure) Case in which both Gaussian outron/exon-TSSs occur 
because of the presence of an SL1 site between the two classes of Gaussian TSSs. (Middle and bottom figures) Cases in which all Gaussian TSSs are exon-TSSs 
and outron-TSSs, respectively. (Vertical bars with asterisk) Representative TSSs with the maximum number of aligned 5'-end tags. (Right bars) The 
numbers of genes in the three classes. As shown in the bottom, alignments of paired-end reads were useful in linking representative TSSs to their gene 
bodies over SL1 sites. (D) Frequency of Gaussian exon/outron-TSSs. (f ) 5'-End reads excluding rRNA reads were categorized into the four groups of reads 
that were mapped onto promoters of outron/exon-TSSs, exons, introns, and intergenic regions, respectively. The Venn diagram illustrates the relationship 
among the four groups. (F) Frequency distribution of distances between 5'-exon boundaries from WS220 (Hillier et al. 2009; Harris et al. 2010) and 
proximal exon-TSSs identified in our data. (C) Examples of 5' capture data for genes that have been previous subjects of TSS mapping. (Above) Exon 5' end 
for the gene myo-1 with original 5' region having four candidate 5' ends (from SI nuclease mapping with accuracy to within a few bp) (Okkema et al. 
1 993) marked with green arrows. (Below) Outron 5' region for rpl-2 with original 5' end (from 5' RACE mapping with accuracy to within 1-2 bp) (Sleumer 
et al. 201 2) highlighted by the green arrow. (H) Frequency distribution of the distances between SL1 sites and most abundant representative outron-TSSs 
supported by 1 0 or more single-end reads. (Left inset) This magnifies the region ranging from 1 0 to 1 00 bp; (right inset) the 90th, 75th, 25th, and 1 0th 
percentiles and the median of outron lengths. (/) Numbers of TSS clusters categorized into NP, BP, and WP types in terms of five values of the minimum 
threshold on the size of TSS clusters. Because multiple TSS clusters may be associated with each gene, the second vertical axis shows the numbers of genes 
involved. Observe that TSS clusters of WP type are dominant for threshold >25. 
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Table 1. Statistics of sequenced 5-end reads from nuclei of 
embryo and adult samples and their alignment to the worm 
genome (WS220) 
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90,575,557 
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34,967,142 


(single-end) 










Adult 


76,731,798 


49,634,285 


30,453,251 


19,181,034 


(single-end) 











primary data set for initial TSS identification. As expected from the 
mass of ribosomal RNA present even after (incomplete) hybrid 
subtraction, the pool of sequenced reads still included approxi- 
mately 24.2 million 5 '-end rRNA reads. We removed these after 
their alignment to the genome, leaving a total of 31.4 million reads 
from the embryonic library. 

Selecting representative TSSs is not simple because the dis- 
tribution of TSSs is not necessarily sharp and unique but is often 
broad and/or multi-modal (Suzuki et al. 2001; Schug et al. 2005; 
Carninci et al. 2006; Frith et al. 2006; Saxonov et al. 2006; Zhu et al. 
2008; Rach et al. 2011), as illustrated in Figure IB. We take several 
approaches in working with the multi-modal character of TSS 
distributions. These include both approaches based on calling one 
or a few most prominent starts for each gene as representative and 
approaches based on a base-by-base distribution of TSS frequencies 
covering an extensive upstream region of every gene. 

In calling individual reference starts based on many noisy 
peaks present in the frequency distribution, one approach is to 
model start site regions as a mixture of Gaussian distributions; such 
an approach has been widely used in the recent analysis of TSS 
distributions (Boyle et al. 2008; Ni et al. 2010; Rach et al. 2011). We 
calculated a series of best-fitting Gaussian distributions using 
Bayesian information criteria (see Fig. IB). We then quantified the 
expression level of the resulting Gaussian-based TSS models by the 
number of reads in the corresponding Gaussian window. To 
eliminate noise in genome-wide analysis, we used 73,277 Gaussian 
TSS peaks that were supported by 10 or more reads (see their po- 
sitions in Supplemental Table SI). In addition, we used a noise- 
filtering method using feature density estimator F-Seq (Boyle et al. 
2008) to exclude TSS peaks whose F-Seq values are less than 4 
standard deviations above the mean of the background (F-Seq has 
been widely used in other studies on TSS distributions) (Ni et al. 
2010; Rach et al. 2011). 

The identified "peaks" from the above analysis ranged from 
cases showing a small number of starts in a tight cluster to those 
spread over hundreds of bases of DNA. While the unique-site peaks 
are readily recognizable as a defined entity, parsing of more dis- 
perse TSS regions into individual peaks is by nature dependent 
on the algorithm and parameters. With >80% of the observed 
Gaussian peaks showing a dispersion (standard deviation) of <20 
bp, the most challenging cases are in the minority (for reference, 
Supplemental Table S2 shows a list of Gaussian peaks parsed from 
the data with a maximal standard deviation parameter of 20 bp, 
and Supplemental Fig. SI presents a statistical analysis of Supple- 
mental Table S2). 

We defined outron-TSSs as candidate TSSs by two criteria: (1) 
They had previously mapped downstream SL1 sites (Allen et al. 
201 1); and (2) the TSS occurred in a window from 10 bp to 3000 bp 
upstream of the translation initiation site of the first downstream 



annotated gene (illustrated in Fig. 1C). The 10-bp lower limit was 
chosen to avoid putative outrons that would likely be too short to 
allow trans-splicing (also see below) as well as confounding in- 
stances resulting from short upstream near-matches to SL1 se- 
quence; the 3000-bp upper limit was chosen based on general 
characteristics of C. elegans promoters (Hunt-Newbury et al. 
2007). The term "outron-TSS" is derived from the term "outron" 
(Blumenthal 2005) indicative of the sequence between the nat- 
ural transcription start site and spliced leader attachment site of a 
relevant gene (Blumenthal 2005). We defined exon-TSSs as can- 
didate TSSs with an annotated coding region with no SL1 trans- 
splice sites that were not also intron 3' splice sites. (A low level 
of trans-splicing sometimes occurs at intron 3' splice sites [Allen 
et al. 2011].) Figure ID indicates the incidence of all Gaussian 
outron/exon-TSSs . 

Among 13,336 genes having Gaussian TSS peaks in the region 
of length 3000 bp upstream of their translation initiation sites, 
7351 genes (55.1%) had only Gaussian outron-TSSs and were 
therefore defined as trans-spliced genes, and 4906 genes (36.8%) 
having only Gaussian exon-TSSs were categorized as non-trans- 
spliced genes (Fig. 1C). For the other 1079 genes (8.1%) that had 
both types of Gaussian TSSs in our data, we place the genes in 
neither category, avoiding for this analysis any decision of whether 
they were trans-spliced. For genes that could be defined un- 
ambiguously as trans-spliced or non-trans-spliced, we assigned 
gene-unique representative outron/exon-TSSs with the maximum 
number of aligned reads. This allowed us to pursue further analysis 
in parallel with (1) all Gaussian peaks as TSSs ("The Gaussian Peak 
Collection"), and (2) representative gene-unique TSSs (one for 
each gene). 

To assess the specificity of our 5 '-SAGE method in detecting 
TSSs, we examined the balance between ends present within 
known RNAs and those in regions upstream of annotated genes, 
assigning the approximately 31.4 million embryonic alignments 
to one or more of the categories: promoters of outron/exon-TSSs, 
exons, introns, and intergenic regions. The former three categories 
are overlapping as illustrated by the Venn diagram in Figure IE. We 
compared our results with those by the CAGE method, another 
widely used method of monitoring 5' ends of RNA. The ratio of 5'- 
end tags mapped to promoter regions is 50.2%, while the ratio of 
5 '-end tags by the CAGE method in the human and mouse ge- 
nomes was smaller than 30% (Carninci et al. 2006), indicating that 
the background signal of our 5 '-SAGE method is similar to that of 
CAGE. Although the modest number of 5' ends present in anno- 
tated transcribed regions (21.6% of the total) are under some sus- 
picion as potentially representing capture of fragmented mRNA 
precursors at positions other than the 5' end, the fact that the vast 
majority (78.4%) are in regions with no known abundant tran- 
scripts suggests that the latter RNAs are almost certainly TSSs. In 
particular, we stress that a situation in which 5' ends were captured 
at random from fragmented RNA present in the samples would 
invariably yield a dramatic enrichment for tags within regions 
annotated in other RNA sequencing projects. That this was not the 
case argues strongly for the enrichment in this analysis of bona fide 
capped 5' ends. 

An independent evaluation of 5 '-end mapping was possible 
for known exon-TSSs that are free from trans-splicing. We first 
evaluated the correspondence of 5' mapping for several C. elegans 
genes in which single-gene studies (generally individual primer 
extension reactions either with a sequencing endpoint or with 
a PCR/sequencing endpoint, or nuclease protection mapping) had 
defined an exon 5' end within a few nucleotides. Although not 
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large in number, such specific examples served as valuable 
benchmarks. For the set of genes evaluated (unc-54, myo-1, myo-2, 
sqt-1, col-12), the major 5' ends defined in earlier studies were 
within a few base pairs of prevalent 5 ' capture sites from this work 
(Krause and Hirsh 1987; Dibb et al. 1989; Park and Kramer 1990, 
1994; Okkema et al. 1993). In addition to this individual gene 
analysis, we examined 5268 presumed non- trans-spliced first-exon 
5' boundaries that were derived from whole RNA-seq data in 
WS220 (Hillier et al. 2009; Harris et al. 2010) and that reside >100 
bp upstream of ATG sites of 1078 genes, analyzing the frequency 
distribution of distances between the known 5' ends and their 
proximal exon-TSSs in our data. This data set would be expected to 
be highly enriched for exon TSS positions. As would be expected 
from concordance of the two data sets, we observed a remarkable 
peak at zero in the distance distribution, confirming the significant 
concordance between known 5' ends of mRNAs and our exon-TSSs 
(Fig. IF). 

Existing outron-TSS mapping studies are much rarer in 
C. elegans, as noted above, due to the challenges inherent in cap- 
turing a transient 5 '-end sequence. We know of five genes that 
have been mapped at high precision through primer extension 
(rol-6 and col-13) (Park and Kramer 1990, 1994), 5'-anchored PCR 
and sequencing (rpl-2 and ace-1) (Culetto et al. 1999; Sleumer et al. 
2012), or SI nuclease protection (ubq-1). In our 5 '-end repertoires, 
we see support for the mapped outron-TSSs for rpl-2 (Fig. 1G), col- 
13, and for two of three 5' ends mapped for ace-1 outron se- 
quences. Rol-6 expression is very low in the samples used for this 
work, presumably due to the precise staging of transcription. Thus, 
little or no signal is seen in this study for any sequence upstream of 
rol-6. For ubq-1, we observe a different 5' end from that mapped 
in the original study by SI nuclease protection. (The prominent 
5' end inferred from the original study was at -455 nt. We ob- 
served a strong end at -294, in a region of the gene that was 
electrophoresed out of the pictured gel in the SI mapping study. 
That study was also carried out with total RNA from heat-shocked 
embryos, while we used nuclear RNA from embryos grown at 
the lower end of the C. elegans' viable temperature range.) These 
comparisons, as well as comparisons with the regional outron 
mapping of Morton and Blumenthal (2011), serve both to support 
the utility of the present data and to provide caution in any uni- 
versal interpretation of data from a limited set of developmental 
timing and growth conditions. Even if stable mRNA is present in 
the sample, these assays will only provide a positive outron signal 
based on recent transcription, specifically addressing the (likely 
transient) presence of the pre-trarcs-splicing nuclear precursor un- 
der the growth conditions of the assay. 

To provide valid outron-TSS candidates, we sought where 
possible to associate each with the body of the corresponding gene. 
Exon-TSS assignments have the advantage of potential validation 
through a continuous set of RNA-seq reads that go into the rele- 
vant mRNA. To further evaluate outron-TSS assignments, we per- 
formed paired-end sequencing of the embryonic library, generat- 
ing 42.2 million paired-end reads of 5 '-end RNA fragments; 23.1 
million pairs mapped to unique positions in the genome (Table 2). 
We then applied an analysis that checked whether the SL1 site in 
a gene was sandwiched by the aligned paired-end reads, providing 
support for the assignment of an outron-TSS to a gene body asso- 
ciated with an SL1 site as shown in Figure 1C. One might be 
concerned with the difficulty of linking outron-TSSs to SL1 sites 
because the distance between paired-end reads (—250 bp being the 
cDNA insert size in our experiment) could be smaller than the 
length of typical outrons. Examining the frequency distribution of 



Table 2. Statistics of sequenced paired-end reads for TSSs from 
embryo nuclei and their alignment to the worm genome 
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mapped 


mapped 
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Embryo 


42,232,142 


37,066,515 


23,145,796 


10,029,550 


(paired-end) 











the distances between the SL1 sites and the most abundant rep- 
resentative outron-TSSs (defined by single-end sequencing) shows 
that, although many TSSs were located at >250 bp upstream of 
corresponding SL1 sites, 40.3% of 7351 TSSs were seen within 250 
bp of SL1 sites (Fig. 1H), indicating that our paired-end sequencing 
data will be useful in linking many TSSs to associated gene bodies. 

In Figure 1H, we find that lengths of outrons, the region be- 
tween the TSS and the trans-splice site, vary greatly, from a 90th 
percentile of 2099 to a 10th percentile of 59, with a median length 
of 369. Whereas the long outrons are similar in length to long 
introns in worms, the putative outrons defined by these data were 
in some cases much shorter than introns can be, consistent with 
the idea that the constraint on the lower limit of intron length 
results from a minimum distance between the two splice sites 
needed for spliceosome assembly. While the possibility of short 
outrons is intriguing, we note that the type of data acquired in this 
work (a static picture of 5 '-end structures) does not provide in- 
formation about precursor-product relationships. Apparent short 
outrons, based on the distance from the predominant TSS to the 
trans-splice site, may lead to misinterpretation. The true outrons 
may derive from initiation events upstream. Indeed, we observed 
upstream TSSs in 118 (76%) among 154 genes for which the pre- 
dominant TSS was within 20 bp upstream of the SL1 site (Sup- 
plemental Table S2). In the remaining 36 genes, because such 
outrons may be processed efficiently, these upstream TSSs might 
either not appear in our data or appear as rarely used sites. Con- 
versely, the outron sites that appear in nuclear RNA would have 
been enriched for slower 5' processing, potentially enriching the 
signal for any short outrons that are spliced inefficiently. 

An intriguing diversity among initiation elements comes in 
the degree to which initiation is specified at a single position in the 
DNA sequence versus a range of positions. Classifiers for peak 
breadth have been applied extensively in such analyses, including 
previous studies on human, mouse, and fruit fly genomes: In each 
case, once a set of reference TSSs was determined, shapes of TSS 
distributions in promoters of individual reference TSSs would be 
assigned (Carninci et al. 2006; Frith et al. 2006; Ni et al. 2010; Rach 
et al. 2011). We classified C. elegans TSS distributions into three 
types: NP (Narrow with Peak), BP (Broad with Peak), and WP (Weak 
Peak) according to the definitions proposed by Ni et al. (2010). The 
precise category fractions varied somewhat depending on the 
minimum threshold on the size of the TSS cluster (e.g., 10, 25, 50, 
75, and 100). For start sites with modest-to-high expression 
(thresholds of more than 25 reads observed) started, WP-type 
shapes were dominant (Fig. II). This observation is consistent with 
WP dominance in other studies of human, mouse, and fruit fly 
genomes (Carninci et al. 2006; Frith et al. 2006; Ni et al. 2010; Rach 
et al. 2011). TSS distribution of WP type frequently ranged over 
a long region (typically of length >500 bp) and had multimodal 
peaks; for some analyses, we thus decompose each WP TSS distri- 
bution into Gaussian peaks to examine characteristics as noted 
below. 
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Nucleosome and Pol II positioning around TSSs and SL1 sites 

Some types of modified histones and RNA Pol II are known to ac- 
cumulate at promoter regions (Whittle et al. 2008; Baugh et al. 
2009), with information related to this accumulation having been 
used to approximate the locations of transcription initiation sites. 
To check whether the TSS collection defined in this study shares 
these characteristics with other species, we examined the publicly 
available data of nucleosome positioning estimated from MNase 
digestion of whole chromatin (Valouev et al. 2008) or of chromatin 
pre-selected for its association with the histone modification mark 
H3K4me2/3 (Gu and Fire 2010), also examining Pol II positioning 
(Gerstein et al. 2010) surrounding representative TSSs. Using either 
representative TSSs (Fig. 2A,B) or the full Gaussian peak collection 
(Supplemental Fig. S2A,B), we observed arrays of positioned and 
phased nucleosomes, with an apparent nucleosome-depleted re- 
gion immediately upstream of TSSs. The upstream nucleosome 
depletion is a conserved feature that has also been reported in 
mammals and other eukaryotes (Mavrich et al. 2008; Kaplan et al. 
2009; Sasaki et al. 2009; Weiner et al. 2010; Valouev et al. 2011). 
The characteristic features of nucleosome positioning relative to 
representative TSSs were also reconfirmed by comparison with 
H3K4me2/3 data (Supplemental Fig. S2D). The period of the nu- 
cleosome positioning around representative TSSs was estimated 
as —160 bp, as evaluated using an autocorrelation measure (Fig. 
2D). The ~160-bp spacing around representative TSSs was smaller 
than the ~175-bp spacing in the entire genome (Valouev et al. 
2008). Nucleosome positioning scores were more pronounced for 



representative TSSs of higher expression (Fig. 2A,B). These obser- 
vations indicate that transcription correlates with more consistent 
packing of nucleosomes. This observation is consistent with bar- 
rier nucleosome models (Mavrich et al. 2008) in which Pol II stalled 
at the TSSs potentially serves as a barrier that can facilitate packing 
of nucleosomes downstream from TSSs. Indeed, Figure 2E exhibits 
Pol II enrichment at —170 bp and —100 bp downstream from 
representative outron/exon-TSSs, respectively. Interestingly, the 
distance between TSS and maximum Pol II enrichment differed 
between outron-TSS sites (—170 bp) and exon-TSSs (—100 bp), 
respectively. Although the mechanistic basis for this difference was 
not clear, the existence of a difference suggested the possibility that 
the biological system would have the opportunity, if needed, to 
distinguish functionally between the two types of 5' end. 

It has been shown that nucleosomes have a tendency to 
associate with DNA sequences exhibiting high G/C nucleotide 
and dinucleotide content, while being excluded from DNA se- 
quences exhibiting high A/T nucleotide and dinucleotide content 
(Bernstein et al. 2004; Field et al. 2008; Hughes and Rando 2009; 
Segal and Widom 2009; Tillo et al. 2010). We therefore examined 
the correlation of this underlying nucleotide signal to nucleosome 
positioning around newly identified representative TSSs (Fig. 3A,B) 
and around the full collection of Gaussian peaks (Supplemental Fig. 
S2E,F); in both analyses, we observed an unexpected phenomenon, 
a significant peak of C/G rate and a remarkable valley of A/T rate 
around —100 bp upstream of outron- and exon-TSSs, corresponding 
to a site where observed nucleosomes were depleted. 
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Figure 2. (A,B) The distributions of nucleosome positioning (local dyad positioning score) around representative outron-TSSs (A) and exon-TSSs (B) of 
expression levels >1 0 and >1 00 (see the distribution around Gaussian TSSs in Supplemental Fig. S2A,B). From these comparisons, note that the periodic 
pattern of nucleosome positioning was of substantially greater amplitude for TSSs of greater expression. (C) Nucleosome positioning (local dyad posi- 
tioning score) around SL1 sites and 5' ends of exons. We used all available SL1 sites without considering their expression levels. (D) Autocorrelation 
analysis: Let s(x) denote the local dyad positioning score. The autocorrelation defined by R(L)= f^ 03 s(x+L)s(x)dx was calculated around representative 
outron/exon-TSSs, SL1 sites, and 5' ends of exons, where L is displayed in thex-axis. Sharp peaks at — 1 60 bp for outron-TSSs and broad peaks surrounding 
— 160 bp for exon-TSSs highlight that arrays of positioned and phased nucleosomes are more prominent around outron-TSSs than around exon-TSSs. 
Broad peaks at —140 bp are also seen for SL1 sites and 5' ends of exons. (f) Pol II occupancy around representative outron/exon-TSSs of expression level 
>10, SL1 sites, and 5' ends of exons. 
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Figure 3. (A,B) A/T and C/G dinucleotide incidence rates around representative outron-TSSs (A) and exon-TSSs (B) of expression levels >1 0 and >1 00 
(see the distribution for Gaussian TSSs in Supplemental Fig. S2E,F). AT richness immediately upstream of outron-TSSs is due to the requirement for SL1 
splicing motif "TTTTCAG." In both types of TSSs, a significant valley in the A/T rate and a significant peak in the C/G rate were observed —100 bp upstream 
of TSSs. (C) A/T and C/G dinucleotide incidence rates around SL1 sites and 5' ends of exons. In all graphs, a running average over a 20-bp window is shown 
for smoothing lines. 



Nucleosome positioning became more prominent around 
outron-TSSs than around exon-TSSs (Supplemental Fig. S2A ; B), 
suggesting that a higher Pol II signal and a more pronounced array 
of positioned nucleosomes were associated with trans-splicing ac- 
tivity. We therefore examined nucleosome positioning around SL1 
sites and found that nucleosome cores were enriched —40 bp 
downstream from SL1 sites (Fig. 2C). One possibility was that this 
enrichment might reflect a combination of the distance distribu- 
tion between TSSs and SL1 sites and the phased array of nucleo- 
somes downstream from TSSs; however, this hypothesis was not 
supported by our statistical analysis (Supplemental Fig. S2G,H). 

As an alternative and more direct link, we considered the 
possibility that genomic signals for splicing machinery and nu- 
cleosome constraint might be associated. Such an association has 
been observed at ds-spliced sites (Kolasinska-Zwierz et al. 2009), 
hence an expectation that a disambiguation of TSSs and SL1 
spliced sites might show such an effect. Consistent with such an 
expectation, nucleosome positioning around 5' ends of exons was 
quite close to that around SL1 sites (Fig. 2C). Furthermore, we also 
observed RNA Pol II enrichment at —100 bp downstream from SL1 
sites and 5' ends of exons (Fig. 2E), and a strong correlation of A/T 
and C/G dinucleotide rates to the nucleosome enrichment 
downstream from SL1 sites and 5' ends of exons (Fig. 3C). These 
observations suggest an available confluence in biological regula- 
tion in which shared genomic signals and underlying de- 
pendencies on the underlying sequence can be used to coordinate 
nucleosome positioning, splicing, and polymerase initiation. 

Evolutionarily conserved sequence motifs around TSSs 

The large number of newly identified TSSs allowed us to search 
regions surrounding the TSSs for sequence motifs that might serve 
as ds-regulatory elements, in particular, transcription factor bind- 
ing sites. Genomic regions ranging from -300 bp to +50 bp of 
representative TSSs (and in a separate analysis around the full 
collection of Gaussian TSS peaks) were selected as putative pro- 
moters. Both analyses were of use, although we note with the full 
Gaussian-TSS collection that multiple closely spaced TSSs in this 
collection would allow some single ds-elements to be multiply 
counted, providing some impetus to mine the representative TSS 
collection preferentially. Candidate promoter regions were ana- 
lyzed using MEME (Bailey and Elkan 1994), a widely used program 
that identifies over-represented sequence motifs within a set 
of input sequences (see Methods). We used MEME to search for 



statistically significant motifs of <10 bp (£-value < 10~ 20 ), noting 
that these computationally predicted motifs may not necessarily 
be functional despite their high statistical significance. 

The evolutionary conservation of a motif is one of the 
strongest lines of evidence for its functional importance. We 
therefore assessed the conservation of each motif occurrence with 
respect to the average conservation rate among the genomes of six 
related nematode species for all positions in the occurrence (see 
Methods). Seventeen evolutionarily conserved motifs were de- 
tected in this analysis; of these, nine were previously known, and 
the remaining eight were novel (Fig. 4A). SP1 is a family of tran- 
scription factors involved in transcriptional control in several 
different systems. Figure 4B illustrates the frequency distribution 
of SP1 motif (Huber et al. 1998; Xi et al. 2007) occurrences around 
representative TSSs. Highly conserved SP1 motif occurrences were 
significantly enriched in promoter regions of representative out- 
ron-TSSs (P = 1.21 X 10" 33 ) and exon-TSSs (P = 4.18 X 10" 14 ) 
according to the Wilcoxon rank-sum test (see Methods). Since all 
motifs were of length <10, the number of all possible <10-mers 
approximates (4/3) x 10 6 [=-(4/3) x 4 10 ]. To reduce the possibility 
of selecting false-positive motifs from <10-mers, we performed 
multiple hypothesis testing using the Bonferroni correction to 
check each motif at a significance level of 5% divided by (4/3) X 
10 6 (=3.75 X 10" 8 ). 

Of note, the SP1 motif was "orientation independent" in the 
sense that occurrences of its reverse complement were enriched 
and conserved around both outron-TSSs and exon-TSSs (Fig. 4C). 
Other examples of orientation-independent motifs were six vari- 
ants of the CGCG motif. These motifs were frequently observed in 
CG-rich regions for both outron-TSSs and exon-TSSs, were prefer- 
entially located 20-80 bp upstream of outron-TSSs, and were sig- 
nificantly conserved around outron-TSSs (Supplemental Fig. S3). 
The evolutionary conservation of a motif and its positional en- 
richment at a specific location upstream of TSSs are both potential 
indicators of the functional importance of these motifs. Although 
C. elegans has several SP1 homologs, it is not clear if the promoter- 
proximal motif represents binding of one of these factors or of 
another unrelated factor with similar sequence specificity. 

Another well-known motif found to be enriched in putative 
promoters was the TATAA-box (Lifton et al. 1978; Goldberg 1979; 
Basehoar et al. 2004). TATAA motifs are highly conserved promoter 
elements shared by all eukaryotes, are typically found closely up- 
stream of TSSs, and are correlated with low-CG-content promoters, 
narrow distributions of TSSs, and a larger variability in expression 
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Figure 4. Conserved sequence motifs upstream of TSSs. (A) List of known motifs identified for outron- and exon-TSSs, and a list of novel motifs. (B) 
Frequency distribution of motif occurrences and their conservation around representative TSSs. (The upper histogram) The frequency distribution of motifs 
around outron-TSSs with downstream SL1 sites; (the lower histogram) the motif distribution around exon-TSSs without SL1 sites. The conservation rate of 
a motif occurrence is the average conservation rate of all positions in a given instance, where the conservation rate in the worm genome was calculated 
using six allied species (see Methods). Conservation rates ranging from 0 to 1 were divided into five levels. Darker colors indicate higher conservation rate; 
for example, the darkest color denotes a range from 0.8 to 1 . In the figure, an SP1 motif (Xi et al. 2007) was observed around outron-TSSs and exon-TSSs. 
The SP1 motif was significantly conserved around a strong peak —50 bp upstream of the TSS, and P-values were calculated for representative TSSs from the 
Wilcoxon rank-sum test (Methods). (C) The SP1 motif is orientation independent in the sense that its reverse complement was observed and conserved 
around outron/exon-TSSs. Motifs were examined upstream of TSSs collected from two samples, embryos and adults. The distributions from the two 
samples tend to be similar; thus, the graphs for embryos are displayed exclusively hereafter unless significant differences were observed between the 
distributions of the two samples. (D) Distribution of TATAA-box occurrences around representative TSSs. TATAA-box occurrences were enriched signif- 
icantly upstream of both types of TSSs according to the rank-sum test (Methods). (£) Distribution of TATAA-box around Gaussian TSSs. (F) A TATAA-box 
occurs upstream of the representative outron-TSS for NM_058327 (taf-4) that is encoded in the plus strand. The outron is displayed as a yellow box. The 
information is available at http://wormtss.utgenome.org/ by making a query of taf-4. 



(Blake et al. 2003; Raser and O'Shea 2004; Carninci et al. 2006; 
Landry et al. 2007; Yang et al. 2007). We found (Fig. 4D) that 
TATAA-box motifs showed high conservation across the six nem- 
atode species and were significantly enriched 30-40 bp upstream 
of both representative outron-TSSs (P = 9.26 X 10~ 16 ) and exon- 
TSSs (P = 6.83 X 10" 29 ). The degree of TATAA-box enrichment 
appears to differ between representative outron-TSSs and repre- 
sentative exon-TSSs in Figure 4D, leading to a concern that TATAA- 
box occurrences around the full collection of Gaussian outron- 
TSSs might have been overlooked. Indeed, more TATAA-box oc- 
currences were seen around the full collection of Gaussian outron- 
TSSs as well as around the full collection of Gaussian exon-TSSs 



(Fig. 4E), although most of them were less conserved. Figure 4F 
presents the promoter region for the taf-4 gene (TBP-associated 
transcription factor, refseq ID NM_058327). taf-4 features a 240-bp 
outron, indicated by the region between the representative TSS 
and the SL1 site, and has a TATAA-box motif located 31 bp up- 
stream of its TSS. The TSS data add to previous analysis indicating 
an upstream TATAA in C. elegans (Grishkevich et al. 2011), pro- 
viding a demonstration of TATAA-box preference at a specific lo- 
cation upstream. 

Supplemental Figure S4 shows several other previously 
described motifs that we identified upstream of C. elegans TSSs. 
T-blocks (stretches of TTT or more Ts) are known to occur 
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frequently at 50-100 bp upstream of a start codon, with longer 
T-block occurrences correlated with lower nucleosome occu- 
pancy and a higher expression level (Grishkevich et al. 2011). 
Using our TSS information, we found that T-blocks and A-blocks, 
the reverse complement of T-blocks, were significantly con- 
served around outron-TSSs, while they were not conserved 
around exon-TSSs. The motif associated with Initiator (Itr) 
binding ([CT][CT]A.[AT][CT][CT]) (Smale and Baltimore 1989) 
occurred frequently and was enriched around outron-TSSs. As 
expected, splice acceptor enrichment was observed proximal to 
outron-TSS sites. 

In contrast, Supplemental Figure S5 presents known motifs, 
most of which were identified in the human or fruit fly genomes 
but were less evident in the worm genome. BRE, a TFIIB recogni- 
tion element, was reported to be located immediately upstream of 
TATAA elements in the human genome (Lagrange et al. 1998); 
however, this tendency was not evident in the worm genome. BRE 
occurrences were conserved significantly, but the number of BRE 
occurrences was fairly small. A DPE (downstream promoter ele- 
ment) was identified in the fruit fly genome (Burke and Kadonaga 
1996) but was not enriched as a downstream promoter element in 
the worm genome (Supplemental Fig. S5). A DCE (downstream 
core element) (Ohler et al. 2002) was investigated by checking the 
appearance of one of the three short motifs CTTC, CTGT, and 
AGC, and was seen ubiquitously (Supplemental Fig. S5). An MTE 
(Motif Ten Element) that was reported to be conserved from fruit 
fly to humans (Lim et al. 2004) was not identified as enriched 
in the worm promoterome. 

In addition to these previously identified sequence motifs, we 
identified eight novel motifs that were statistically enriched 
proximal to worm TSSs and also showed evidence for evolutionary 
conservation (Fig. 4A). Six of these motifs may have usage or roles 
specific to outron-TSSs because they were conserved more signifi- 
cantly around outron-TSSs than around exon-TSSs. The six motifs 
were each orientation independent and likely to occur in narrow 
regions, usually —50 bp upstream of outron-TSSs (Fig. 5). These 
motifs were modestly frequent around representative TSSs as well 
as around Gaussian TSSs. We also noticed that the novel motifs 
were prevalent in bidirectional promoters (discussed later), sug- 
gesting their functional relevance to transcription factor-binding 
sites. Two less frequent, orientation-dependent motifs were found 
upstream of exon-TSSs (Supplemental Fig. S6). 

Sequence analysis alone cannot conclusively demonstrate 
that individual motifs are transcription-factor (TF) binding sites. 
To this end, the worm modENCODE project (Gerstein et al. 2010) 
provided an informative list of 12 candidate motifs that were 
observed in proximity to transcription-factor (TF) binding sites 
HLH-1, LIN-13, MEP-1, and PHA-4. Of note, the list includes 
sequences that are highly similar to four of our novel motifs 
(BiSLl-A, BiSLl-F, reverse complements of BiSLl-D and BiSLl-C 
in our notation), motivating the experimental validation of other 
novel motifs as TF-binding sites. We analyzed the frequency dis- 
tribution of distances between binding sites of each TF and newly 
determined representative TSSs, observing that binding sites of 
each TF were enriched 50-100 bp upstream of representative TSSs 
(Fig. 6A). In addition, we examined the frequency distribution of 
distances between binding sites of each TF and positions of each 
sequence motif. We found that the positions of each of four 
motifs (BiSLl-A, BiSLl-C, BiSLl-D, and BiSLl-F) significantly 
accorded with those of each TF (Fig. 6B; Supplemental Fig. S7, P < 
10~ 18 ). They would be good starting points for future functional 
analysis. 



Motifs associated with tissue-specific gene expression 

Motifs identified upstream of TSSs might have a fundamental role 
as ds-elements that regulate expression of genes in particular tissue 
types. We therefore examined the association between each motif 
and the expression of its associated gene (defined as the nearest 
downstream gene). As a measure of tissue-specific expression, we 
used the publicly available LongSAGE data, a comprehensive col- 
lection of SAGE data derived from a variety of tissues (e.g., oocytes, 
embryos, gut, muscle, hypodermal, neurons, neural cells, pharynx 
cells, and gonad: British Columbia C. elegans Gene Expression 
Consortium, http://elegans.bcgsc.bc.ca/) (Jones et al. 2001; McKay 
et al. 2003; Blacque et al. 2005; Khattra et al. 2007; McGhee et al. 
2007; Meissner et al. 2009; Wang et al. 2009). Two thousand three 
hundred and forty-two (2342) genes have SAGE data as well as at 
least one of our identified motifs upstream of their TSS. To test 
whether the presence of a particular motif tends to be associated 
with higher expression (represented by steady-state mRNA levels) 
in a given tissue compared with genes lacking the motif, we per- 
formed Wilcoxon rank-sum tests for assessing the null hypothesis 
that distributions of expression levels from the two sets of genes 
were equivalent. Figure 6C and Supplemental Figure S8 show the 
results of this analysis. We performed multiple hypothesis testing 
to check each motif at a significance level of 3.75 X 10~ 8 as we 
examined the significance of the conservation of a motif. P-values 
less than the significance level are highlighted in red in Figure 6C. 
Notably, the presence of the novel BiSL-A motif upstream of 
a gene's TSS was associated with significantly higher expression in 
oocytes, embryos, gut, muscle, hypodemal, pharynx, and gonad 
samples. Genes with CpG motif 4 in their putative promoters 
showed elevated gene expression in neuronal cells. High expres- 
sion in gonad cells was associated with the presence of several 
upstream motifs, including AAAA, BiSLl-A, CpG motif 1-2, and 
SL1. The statistical association of several of our identified motifs 
with tissue-specific patterns of mRNA expression indicates that 
these sequences may be functionally important in proper gene 
expression and are strong candidate transcription factor binding 
sites. 

Potential bidirectional promoters 

Bidirectional promoters provide a compact mechanism for the 
joint transcriptional control of two coding regions. A compre- 
hensive catalog of potential bidirectional promoters has not been 
available for C. elegans, due to limits in our knowledge of TSSs. Our 
comprehensive collection of TSSs together with information about 
gene expression levels provides a means of identifying candidate 
bidirectional promoters. Examining the candidate bidirectional 
promoters (those associated with TSSs 60-160 bp of each other) 
revealed that many of the identified motifs were frequently ob- 
served in these regions (Fig. 7; Supplemental Fig. S9). This class of 
C. elegans promoter should be of great interest for future experi- 
mental analysis. 

Discussion 

The study provides a comprehensive collection of TSSs for 7351 
trans-spliced genes in C. elegans such that their TSSs and gene 
bodies were linked by substantial paired-end reads, confirming the 
validity of the TSSs for the corresponding trans-spliced genes. 

Transcription of many genes initiates predominantly at 
a single site. This could be due to selection for a single, regulatable 
promoter or for a single defined 5' UTR. If the selection is at the 
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Figure 5. Frequency distributions of six novel orientation-independent motifs around representative and Gaussian outron/exon-TSSs. Individual motifs 
are given new identifiers BiSLI-A-F. The distributions of the reverse complements of individual motifs are shown in the last two columns. P-values 
according to the Wilcoxon rank-sum test (see Methods) are shown for the distributions around representative outron/exon-TSSs. 



level of the UTR, we might expect that trans-splicing would elim- 
inate that pressure, since the upstream region is replaced by 
a uniform 22-nt UTR. To address this question with the current 
data, we analyzed the average numbers of Gaussian outron/exon- 
TSSs in the regions of the same length, 3000 bp, upstream of SL1 
sites for outron-TSSs and upstream of genes for exon-TSSs. We 
observed that the outron-TSS average is significantly larger than 
the exon-TSS average (Supplemental Fig. 10A) according to the 



z-test (P- value < 10~ 116 ). In addition, the analysis of the distribu- 
tions of the numbers of outron- and exon-TSSs per gene also 
showed that the outron-TSS distribution was significantly greater 
than the exon-TSS distribution (Supplemental Fig. 10B) according 
to the Wilcoxon rank-sum test (P- value < 10~ 117 ). 

These observations are consistent with models in which the 
presence of trans-splicing might remove selective pressure for 
having a single TSS. In addition to potential differences in selective 
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pressure, several conceivable mechanistic differences might serve 
as a basis for the observed difference in dispersion between exon- 
and outron-TSS distributions (e.g., subtle differences in sequence 
composition or chromatin structure between outron and upstream 
exon regions). The combination of mechanistic and selective dif- 
ferences in the two 5' regions should certainly become clearer as 
additional analytic and experimental tools are applied to C. elegans 
promoters. While sequence composition and technical influences 
cannot be ruled out, these observations are intriguingly consistent 
with the possibility that trans-splicing may afford an advantage 
to the organism in permitting flexibility of 5' gene structures 
(Blumenthal 2005). 

Arrays of positioned and phased nucleosomes have been seen 
around TSSs in mammals and other eukaryotes (Mavrich et al. 
2008; Kaplan et al. 2009; Sasaki et al. 2009; Weiner et al. 2010; 
Valouev et al. 2011), with this study confirming this characteristic 
in C. elegans. In addition, nucleosome positioning and Pol II en- 
richment downstream from SL1 sites suggested a distinctive asso- 
ciation between signals for nucleosome positioning and signals for 
trans-splicing. In the regions upstream of newly identified TSSs, 
several evolutionarily conserved motifs were uncovered, including 
known motifs (e.g., TATAA-box and SP1) found in other species as 
well as six novel TSS-specific motifs with downstream SL1 sites. 
These discoveries led us to reveal 694 potential bidirectional pro- 
moters of —100 bp in length that harbor orientation-independent 
motifs such as CpG, SP1, and novel motifs. We hope that this data 
set will serve as a valuable resource in functional genomics studies 
for C. elegans. 

Methods 

5'-End library construction 

Figure 1A illustrates the process of 5 '-end library construction 
(Hashimoto et al. 2009). C. elegans nuclei were prepared by dis- 
ruption of animals and sedimentation of lysates through a 2 M 
sucrose cushion (lysis and gradient reagents from Sigma-Aldrich, 
catalog no. NUC-101). We started with ~2 mL of worm grind 
(adult or embryo) for each nuclear preparation, with purified nu- 
clei stored in 200 |xL of the NUC-101 storage buffer. We used TRIzol 
for the isolation of total RNA, with subsequent cleaning using the 
RNeasy Mini (QIAGEN). We then replaced the cap structure of 
mRNA with an Illumina RNA linker according to the oligo-cap 
method (Maruyama and Sugano 1994) as follows: The purified 
total RNA (starting with 5 |xg) was treated with bacterial alkaline 
phosphatase (TaKaRa). The RNA was extracted twice with phenol: 
chloroform (1:1), ethanol-precipitated, and treated with tobacco 
acid pyrophosphatase (TAP). Then a RNA linker was ligated using 
RNA ligase (TaKaRa): a 5'-oligo (5'-ACAGGUUCAGAGUUCUAC 
AGUCCAGCAG-3') linker. After linker ligation, the RNA was de- 
pleted of ribosomal RNA using the RiboMinus Eukaryote Kit (for 
RNA-seq, Invitrogen, catalog no. 10837-08). After removing ribo- 



somal RNA, a cDNA library was produced from 10 pmol of random 
hexamer primer (5 ' -CAAGCAGAAGACGGCATACGACAGCAGN 
NNNNNC-3') using Superscript III (Invitrogen) by incubating for 
10 min at 26°C, 90 min at 50°C, and 15 min at 70°C. After first- 
strand synthesis, RNA was no longer needed and was degraded 
in 15 mM NaOH for 1 h at 65°C. cDNA was amplified in a volume 
of 100 |xL using PCR with 16 pmol of 5' (5 ' -AATGATACGGCGA 
CCACCGACAGGTTCAGAGTTCTACAGTCC-3 ') and 3' (5'-CAAG 
CAGAAGACGGCATACGA-3 ') PCR primers. The cDNA was pro- 
duced using 15-20 cycles of 1 min at 94°C, 1 min at 58°C, and 
10 min at 72°C. The PCR fragments were size-fractionated by 8% 
polyacrylamide gel electrophoresis, and the fraction of 150- 
300 bp was recovered. The quality and quantity of the obtained 
single-stranded first-strand cDNAs were assessed using Bio- 
Analyzer (Agilent). 

Read alignment 

To map the short reads of Illumina GAII to the reference genome of 
C. elegans, we used BWA with the default setting (4% of mis- 
matches of read lengths were allowed). Many reads were elimi- 
nated unless they contributed to the TSS identification. For ex- 
ample, several trans-spliced reads failed to be mapped because of 
splice-reader sequences (SL1, SL2, etc.) attached to the head of the 
sequences after transcription. These unmapped reads were useless 
in identifying TSSs and were therefore discarded. Despite applica- 
tion of the RiboMinus Eukaryote Kit, we found that one-third of 
the reads were fragments of rRNA. These numerous rRNA reads 
simply reconfirmed rRNA regions that were well annotated in 
WormBase. 

TSS peak calling 

After mapping of the short-read sequences to the reference ge- 
nome, we sorted reads in the order of the genome coordinate and 
then computed the number of reads mapped to the same position 
to collect candidate TSS peaks. The distributions of the peaks 
showed various shapes; single-peak, multiple peaks, broadly dis- 
tributed peaks, etc. To classify these peak types, we attempted to fit 
a Gaussian mixture model to our data. One issue was to determine 
the number of Gaussian distributions. We used X-means clustering 
(Pelleg and Moore 2000) of peaks that used Gaussian model and 
penalty score to avoid overfitting (Bayesian-information criteria, 
BIC), so that the result of X-means clustering is expected to have 
a nice fit to a Gaussian-mixture model representing the observed 
peaks. The observed number of clusters before genes ranged from 
1 to 70. These clusters contain noisy peaks that may inadequately 
distort the Gaussian distributions. Thus, for each cluster, we 
trimmed marginal peaks (less than 5 tag counts) and applied the 
Gaussian distribution so as to compute N (gene expression level) 
and a (distribution shape) values. TSSs within 10 bp from the SL1 
sites were removed to avoid confusion with the trans-spliced 
tags. We selected clusters with the highest expression level within 



Figure 6. Motifs associated with transcription factors and tissue-specific genes. (A) For transcription factors (TF), HLH-1 , LIN-1 3, MEP-1 , and PHA-4, we 
obtained TF-binding scores (ChlP-seq signals) from the modENCODE database (Gerstein et al. 201 0), calculating the average TF-binding score around the 
representative TSSs. The TF-binding score is enriched 80-50 bp upstream of representative exon-TSSs as well as outron-TSSs, respectively. (B) Frequency 
distribution of motif occurrences around positions associated with locally maximum HLH-1 binding scores. The frequency distributions for LIN-1 3, MEP-1 , 
and PHA-4 can be found in Supplemental Figure S7. Color shade represents conservation rates (see Fig. 4B; Methods). We checked the null hypothesis that 
two distributions of conservation rates of a motif in proximity to TF-binding sites and in the other nonpromoter genomic regions were equal, by using the 
Wilcoxon rank-sum test (see Methods). A lower P-value of a motif indicates a higher enrichment of the motif at the binding sites. (C) The rows present 
sequence motifs, and the columns show tissue types in the LongSAGE data, except for the last column that displays the number of genes associated with 
each sequence motif among 2342 genes. The table presents P-values of the Wilcoxon rank-sum test for assessing the null hypothesis that two distributions 
of expression levels of genes with the presence or the absence of the upstream motif were equal in each tissue. To highlight significantly low P-values 
according to the Bonferroni correction, we colored P-values < 3.75 x 1 0~ 8 red, and P-values ranging from 3.75 x 1 0~ 8 to 1 0~ 3 orange. 
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Figure 7. Potential bidirectional promoters. (A) Distribution of distances between paired proximal TSSs in opposite strands. The remarkable number of 
loci (total instances more than 600) with paired divergent TSSs at a distance of 60-1 60 bp suggests an abundance of potential bidirectional promoters. 
This range (60-1 60 bp) is consistent with the observation that motif sequences (candidate c/s-elements) are typically enriched —50 bp upstream of the 
TSSs. In reality, many motifs, including novel ones, were observed frequently in those potential regions. (£) Example of candidate bidirectional promoter 
region. Two expressed representative outron-TSSs in the opposite strands were separated by 1 1 5 bp. A novel motif (TTCTCTCGGA), named "BiSLI -B" in 
Figure 5 (respectively, its reverse complement), was identified —50 bp (—70 bp) upstream of the plus (minus) strand representative TSS. This is in 
accordance with the enrichment of the motif —30-50 bp (70-100 bp) upstream of the plus (minus) strand TSS obtained from our genome-wide analysis 
(see BiSLI -B in Fig. 5). Of note, the motif was highly conserved among the six allied species. The two outrons from the representative TSSs to their nearest 
downstream SL1 sites were different in length, demonstrating the difficulty of estimating outrons and TSSs solely from SL1 sites and the genomic 
sequence. (C) Another example of a bidirectional promoter. Two common motifs (SP1 and CpG motif 4) were highly conserved between two expressed 
outron-TSSs. Multiple highly expressed and isolated TSSs were seen in both strands, and the wide distribution of the occurrences of these motifs may 
account for the multiple outron-TSSs (SP1 in Fig. 4B and CpG motif 4 in Supplemental Fig. S3). Many more examples of regions harboring conserved 
motifs between expressed outron-TSSs were seen (Supplemental Fig. S9), which could function as potential bidirectional promoters (696 candidates in 
Supplemental Table S4). Expression levels between the pairs of genes have a very small similarity (Pearson's correlation coefficient <0.066). 



3000 bp before the gene start sites as representative TSSs. Sup- 
plemental Table SI gives a gene-by-gene list of representative TSSs 
as well as Gaussian TSSs. In the table, we used the gene annota- 
tions in WormBase WS220. We merged gene models that over- 
lapped in individual regions of the genome into unique groups 
(choosing the longest model in each case) so as to associate each 
TSS with a single neighboring entity. 

Finding motifs and their occurrences around TSSs 

Finding motif sequences is a computationally expensive task be- 
cause of numerous combinations of ACGT letters of various 
lengths. MEME is a widely used program for detecting motif se- 
quences. To speed up the analysis, we used an MPI version of 
MEME, and it took us —1 d to run the program using a cluster of 
machines equipped with 64 CPUs using the option "-p 64 -maxsize 
10000000 -dna -nmotifs 30 -mod zoops -minw 3 -maxw 10." Each 
letter in a motif is associated with bits that express a likelihood of 
the presence of the letter at the position in the motif. In searching 
all promoter regions (350 bp to -50 bp around TSSs) for a motif, we 
used letters associated with >1 bits in a motif for testing exact 
matches, while we replaced other letters by ?, a wild card matching 
any one letter. In this way, we transformed a motif into a regular 
expression. For example, we converted the SP1 motif in Figure 4B 
into the regular expression "C?CC?CC" and the TATAA-box in 
Figure 4D into "TATAAAA." We then searched all promoter regions 
for the matches of the regular expression corresponding to a motif. 
We note that the use of regular expressions to search for many 
known motifs was preferable over weight-based methods, because 



the relevant publications lacked bit-significance scores. To be 
consistent with MEME, for the motifs found by MEME, we also 
used regular expressions of characters whose bit-significance scores 
were >1. 

Measuring the significance of the conservation of a motif 
around TSSs 

We calculated the conservation rates of motifs at each position 
from a comparison of the genomes of six related nematode 
species, C. elegans (The C. elegans Sequencing Consortium 1998), 
Caenorhabditis brenneri (Stein et al. 2003), Caenorhabditis brigg- 
sae, Caenorhabditis remanei, Caenorhabditis japonica, and Pris- 
tionchus pacificus (Thomas 2008), with alignments derived from 
the UCSC Genome Browser resource (Siepel et al. 2005; Fujita 
etal. 2011). 

Subsequently, we tested the significance of the conservation 
of a motif in promoter regions that range from 300 bp to -50 bp of 
representative TSSs. If Gaussian TSSs are considered, the same 
motif occurrence can be counted more than once. To avoid this 
double counting, we focused on representative TSSs so as to asso- 
ciate any motif occurrence with its downstream representative TSS, 
uniquely. We checked the null hypothesis that two distributions 
of conservation rates of a motif in promoter regions and in the 
other nonpromoter genomic regions were equal, by using the 
Wilcoxon rank-sum test. The number of all occurrences of a focal 
motif in nonpromoter regions was huge, and using them all was 
likely to lead to incorrect calculation of P-values. We therefore 
selected 10,000 instances at random from all the occurrences, 
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and performed this trial 100 times to calculate P- values. Since some 
P-values happened to be extremely small or large, the median of 
the P-values was treated as the representative P-value. Because 
we considered motifs of <10-mers in length, according to the 
Bonferroni correction, we tested each motif at a significance 
level of 5% divided by (4/3) X 10 6 (=3.75 X 1(T 8 ), where (4/3) X 10 6 
approximates the number of all possible <10-mers. 

Data access 

All sequence data are deposited at the NCBI Sequence Read Archive 
(SRA) (http://www.ncbi.nlm.nih.gov/sra) under accession number 
SRA060670. Our genome browser is available at http://wormtss. 
utgenome.org/. Both the primary data and analytical tables 
in BAM and WIG format are also available at http://wormtss. 
utgenome.org/browser/download.jsp, so that public databases 
such as WormBase and UCSC can incorporate our data easily into 
their sites. 
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